Code
df.long <- read.csv("data/dataLong.csv")Read the data that we prepared previously.
df.long <- read.csv("data/dataLong.csv")Let’s load packages/libraries.
# these are mostly for data management/wrangling and visualization
library(tidyverse) # for most things
library(foreign) # for reading SPSS files
library(readxl) # read MS Excel files
library(showtext) # get fonts
library(glue) # simplifies mixing text and code in figures and tables
library(arrow) # support for efficient file formats
library(grateful) # create table+references for packages used in a project
library(styler) # only a one-time installation (it is an Rstudio plugin)
library(car) # for car::recode only
library(skimr) # data skimming
library(lubridate) # for handling dates in data
library(janitor) # for many things in data cleaning
# these are mostly for data analysis and visualization
library(gtsummary)
library(scales)
library(visdat)
library(psych)
library(lme4)
library(nlme)
library(broom.mixed)
library(patchwork)
library(easystats)
library(mice)
library(modelsummary)
library(ggdist)
library(kableExtra)
library(formattable)
library(ggrepel)
library(ggrain)
library(sjPlot)Define a ggplot theme theme_ki(), a standard table function, kbl_ki(), and a color palette based on KI’s design guide, ki_color_palette.
source("ki.R") # this reads an external file and loads whatever is in itdf.long %>%
filter(Group == "Control") %>%
ggplot(aes(x = value, fill = measure)) +
geom_histogram(binwidth = 4) +
facet_grid(measure~time) +
theme_ki() +
scale_fill_manual(values = ki_color_palette) +
labs(title = "Outcomes over time",
subtitle = "Control group",
x = "",
y = "Number of respondents")df.long %>%
filter(!Group == "Control") %>%
ggplot(aes(x = value, fill = measure)) +
geom_histogram(binwidth = 4) +
facet_grid(measure~time) +
theme_ki() +
scale_fill_manual(values = ki_color_palette) +
labs(title = "Outcomes over time",
subtitle = "Intervention group",
x = "",
y = "Number of respondents")df.long %>%
filter(Group == "Control") %>%
ggplot(aes(x = value, fill = measure)) +
geom_density() +
facet_grid(measure~time) +
theme_ki() +
scale_fill_manual(values = ki_color_palette) +
labs(title = "Outcomes over time",
subtitle = "Control group",
x = "",
y = "Density of respondents")df.long %>%
filter(!Group == "Control") %>%
ggplot(aes(x = value, fill = measure)) +
geom_density() +
facet_grid(measure~time) +
theme_ki() +
scale_fill_manual(values = ki_color_palette) +
labs(title = "Outcomes over time",
subtitle = "Intervention group",
x = "",
y = "Density of respondents")df.long %>%
filter(measure == "DEP") %>%
ggplot(aes(x = value, fill = Group)) +
stat_halfeye() +
facet_grid(Group~time) +
theme_ki() +
#scale_fill_manual(values = ki_color_palette) +
labs(title = "Depression outcomes over time",
x = "",
y = "Density of respondents")What do these distributions look like to you?
df.long %>%
mutate(time = as.factor(time)) %>%
ggplot(aes(x = time, y = value, fill = Group)) +
geom_violin(position = position_dodge(0.9),
alpha = 0.9) +
geom_boxplot(position = position_dodge(0.9),
color = "white",
width = .2,
notch = TRUE,
outlier.shape = NA) +
facet_wrap(~measure,
ncol = 1) +
theme_ki() +
labs(title = "Outcomes over time",
x = "Time point",
y = "Distribution of outcome measurements")Notice the notch in the boxplot. This is from the documentation (?geom_boxplot): > If FALSE (default) make a standard box plot. If TRUE, make a notched box plot. Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different.
Using the package ggrain we can get this sweet figure that combines a boxplot, a half/split violin plot, jittered points for individuals, and lines between individuals across time points!
df.long %>%
filter(measure == "DEP") %>%
ggplot(aes(x = time, y = value, group = time, fill = Group, color = Group)) +
geom_rain(
boxplot.args = list(color = "black", outlier.shape = NA),
id.long.var = "id"
) +
scale_fill_brewer(palette = "Dark2",
aesthetics = c("color","fill")) +
theme_ki() +
facet_wrap(~Group,
nrow = 2) +
labs(title = "Depression over time")And a slightly different version - can you see what is different and how it relates to the code?
df.long %>%
filter(measure == "DEP") %>%
ggplot(aes(x = time, y = value, group = time, fill = Group, color = factor(id))) +
geom_rain(
alpha = .6,
boxplot.args = list(color = "black", outlier.shape = NA),
id.long.var = "id"
) +
scale_fill_brewer(palette = "Dark2") +
scale_color_viridis_d(guide = "none") +
theme_ki() +
facet_wrap(~Group,
nrow = 2) +
labs(title = "Depression over time")df.long %>%
filter(measure == "DEP") %>%
group_by(time,Group) %>%
reframe(Mean = mean(value, na.rm = T),
SD = sd(value, na.rm = T)) %>%
ggplot(aes(x = time,
y = Mean,
group = Group,
color = Group)) +
geom_point(size = 3) +
geom_line(linewidth = 1.3) +
theme_ki() +
scale_y_continuous(limits = c(0,42)) +
geom_ribbon(aes(ymin = Mean-SD,
ymax = Mean+SD,
fill = Group),
alpha = 0.1,
linetype = 0) +
labs(y = "Mean Depression Score",
x = "Time point",
caption = "Note. Shaded area indicates one standard deviation.") +
theme(plot.caption = element_text(hjust = 0))df.long %>%
filter(measure == "DEP") %>%
group_by(time, Group) %>%
reframe(
Mean = mean(value, na.rm = T),
SD = sd(value, na.rm = T)
) %>%
ggplot(aes(
x = time,
y = Mean,
group = Group,
color = Group
)) +
geom_point(
size = 3,
position = position_dodge(.5)
) +
geom_line(
linewidth = 1.3,
position = position_dodge(.5)
) +
theme_ki() +
scale_y_continuous(limits = c(0, 42)) +
geom_errorbar(
aes(
ymin = Mean - SD,
ymax = Mean + SD
),
position = position_dodge(.5), width = .2
) +
labs( # title = "Depression over time by group",
y = "Mean Depression Score",
x = "Time point",
caption = "Note. Error bars indicate one standard deviation."
) +
theme(plot.caption = element_text(hjust = 0))We will start with DEP as outcome and fit a linear model.
First, we’ll split our measure variable into three separate variables (while retaining time as its own variable), using pivot_wider().
Second, we’ll remove the last time point in this first model, in order to make our data fit a linear model better.
Both of these transformations are made in the code chunk below.
df.model1 <- df.long %>%
filter(!time == "3") %>% # since time is a character variable in df.long we need ""
pivot_wider(names_from = "measure",
values_from = "value") %>%
rename(Depression = DEP,
Anxiety = ANX,
Stress = STRESS) %>%
mutate(time = as.integer(time))Now we can specify and fit a model:
m1 <- lmer(data = df.model1,
Depression ~ time + Group + time*Group + (1 | id),
REML = TRUE)Can you decipher the syntax (before looking below)?
lmer
The general trick is that the formula follows the form -
dependent ~ independent | grouping. The grouping is generally a random factor, you can include fixed factors without any grouping and you can have additional random factors without any fixed factor (an intercept-only model). A+between factors indicates no interaction, a*indicates interaction.
For random factors, you have three basic variants:
- Intercepts only by random factor:
(1 | random.factor)- Slopes only by random factor:
(0 + fixed.factor | random.factor)- Intercepts and slopes by random factor:
(1 + fixed.factor | random.factor)
Note that variant 3 has the slope and the intercept calculated in the same grouping, i.e. at the same time. If we want the slope and the intercept calculated independently, i.e. without any assumed correlation between the two, we need a fourth variant:
Intercept and slope, separately, by random factor:
(1 | random.factor) + (0 + fixed.factor | random.factor). An alternative way to write this is using the double-bar notationfixed.factor + (fixed.factor || random.factor).
Reference links for formulas:
Our linear mixed model makes a lot of assumptions about the structure of data. It is important that we investigate this before looking at the output of the model, which may be entirely misleading if assumptions are not met.
check_model(m1)tab_model(m1)| Depression | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 16.91 | 14.17 – 19.65 | <0.001 |
| time | -0.74 | -1.86 – 0.38 | 0.196 |
| Group [MiCBT] | -0.75 | -4.72 – 3.22 | 0.709 |
| time × Group [MiCBT] | -2.72 | -4.38 – -1.06 | 0.001 |
| Random Effects | |||
| σ2 | 31.09 | ||
| τ00id | 80.29 | ||
| ICC | 0.72 | ||
| N id | 106 | ||
| Observations | 282 | ||
| Marginal R2 / Conditional R2 | 0.055 / 0.736 | ||
See https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html for more details on generating tables with sjPlot functions. One drawback with this package is that it only generates HTML-tables, which don’t work well with PDF and Word output formats.
An alternative is gtsummary, which is more flexible regarding the output, but may need a bit more work. See https://www.danieldsjoberg.com/gtsummary/index.html for examples.
tbl_regression(m1)| Characteristic | Beta | 95% CI1 |
|---|---|---|
| time | -0.74 | -1.9, 0.38 |
| Group | ||
| Control | — | — |
| MiCBT | -0.75 | -4.7, 3.2 |
| time * Group | ||
| time * MiCBT | -2.7 | -4.4, -1.1 |
| 1 CI = Confidence Interval | ||
You can also get “raw” output.
summary(m1)Linear mixed model fit by REML ['lmerMod']
Formula: Depression ~ time + Group + time * Group + (1 | id)
Data: df.model1
REML criterion at convergence: 1974.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.3880 -0.5395 -0.0722 0.4928 3.3691
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 80.29 8.960
Residual 31.09 5.576
Number of obs: 282, groups: id, 106
Fixed effects:
Estimate Std. Error t value
(Intercept) 16.9085 1.3922 12.145
time -0.7405 0.5711 -1.297
GroupMiCBT -0.7535 2.0160 -0.374
time:GroupMiCBT -2.7173 0.8442 -3.219
Correlation of Fixed Effects:
(Intr) time GrMCBT
time -0.372
GroupMiCBT -0.691 0.257
tim:GrpMCBT 0.252 -0.676 -0.366
More friendly formatted with tidy() makes it easy to create a simple table.
| effect | group | term | estimate | std.error | statistic |
|---|---|---|---|---|---|
| fixed | NA | (Intercept) | 16.91 | 1.39 | 12.14 |
| fixed | NA | time | -0.74 | 0.57 | -1.30 |
| fixed | NA | GroupMiCBT | -0.75 | 2.02 | -0.37 |
| fixed | NA | time:GroupMiCBT | -2.72 | 0.84 | -3.22 |
| ran_pars | id | sd__(Intercept) | 8.96 | NA | NA |
| ran_pars | Residual | sd__Observation | 5.58 | NA | NA |
And some additional summary stats.
This uses functions from easystats. You can look at the link for other examples.
plot(parameters(m1))This function autogenerates text describing the model output.
report(m1)We fitted a linear mixed model (estimated using REML and nloptwrap optimizer)
to predict Depression with time and Group (formula: Depression ~ time + Group +
time * Group). The model included id as random effect (formula: ~1 | id). The
model's total explanatory power is substantial (conditional R2 = 0.74) and the
part related to the fixed effects alone (marginal R2) is of 0.06. The model's
intercept, corresponding to time = 0 and Group = Control, is at 16.91 (95% CI
[14.17, 19.65], t(276) = 12.14, p < .001). Within this model:
- The effect of time is statistically non-significant and negative (beta =
-0.74, 95% CI [-1.86, 0.38], t(276) = -1.30, p = 0.196; Std. beta = -0.06, 95%
CI [-0.14, 0.03])
- The effect of Group [MiCBT] is statistically non-significant and negative
(beta = -0.75, 95% CI [-4.72, 3.22], t(276) = -0.37, p = 0.709; Std. beta =
-0.30, 95% CI [-0.65, 0.04])
- The effect of time × Group [MiCBT] is statistically significant and negative
(beta = -2.72, 95% CI [-4.38, -1.06], t(276) = -3.22, p = 0.001; Std. beta =
-0.21, 95% CI [-0.33, -0.08])
Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.
What happens if we define time as a factor and add time point 3?
Add random slopes?
Add pre-measurement covariate?
This package is really great and has nice documentation: https://vincentarelbundock.github.io/marginaleffects/
This series of blog posts on GLM’s and causal inference is awesome:
https://solomonkurz.netlify.app/blog/2023-04-12-boost-your-power-with-baseline-covariates/